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(54) Method for building speech and/or language recognition models 



(57) A method for building the speech and/or lan- 
guage models (41 , 42) which are initially loaded in a mul- 
timodal handheld device (4), such as a PDA, a mobile 
phone or a toy. The speech and language models are 
used for speech recognition. The method comprises the 
following steps: 

said speaker selects (406) on a visual user interface 



of said multimodal handheld device one or several 
speaker clusters to which said speaker belongs, 
said multimodal handheld device (4) sends (408) 
said selected clusters to a remote server (1), 
cluster-dependent speech and/or language models 
(41, 42) are downloaded (412) from said remote 
server (1) to said multimodal handheld device (4). 
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Description 

[0001] The present invention concerns a method for 
building speech and/or language recognition models. In 
particular, the present invention concerns a method 
adapted to multimodal handheld devices such as PDAs 
(personal digital assistants), mobile phones, PDAs com- 
bined with mobile phones, and toys. 
[0002] PDA have established themselves as a con- 
venient replacement of paper diaries. Many PDAs also 
offer a wide range of software applications and services 
for people on the move. Modem PDAs include a GPRS 
(General Packet Radio Service).. UMTS (Universal Mo- 
bile Telecommunication System) or Wireless LAN IEEE 
802.11 b interface and are therefore capable of being 
always connected to a telecommunication network ("al- 
ways on"). 

[0003] For a broad market acceptance, PDAs should 
be kept small and light-weight. This implies severe re- 
strictions on the input means that can be integrated. The 
present solutions include for example soft keyboards or 
keypads or pen-based handwriting recognition systems. 
While pen-based selection in a visual menu is conven- 
ient, getting longer texts into a PDA is hardly practicable. 
[0004] Speech recognition based systems have al- 
ready been used with success as a complementary in- 
put mode to enhance and complement the pen-based 
interface. Speech is very natural and easy to use; fur- 
thermore., the text acquisition speed is much higher than 
with most soft keyboards or pen-based input systems. 
For cars drivers, speech recognition provides a hands- 
free and eyes-free interface. 

[0005] Experiences with existing systems show that 
speech input is an ideal complement for pen-based 
handheld devices; people tend to prefer speech for en- 
tering data and pen or buttons for corrections and point- 
ing. 

[0006] Most existing speech recognition modules in 
handheld devices use neuronal networks or Hidden 
Markov Models (HMMs) with speaker-independent 
speech recognition models. For each speech element 
in a dictionary of words to be recognised, a speech mod- 
el is stored in the handheld device. Predefined speech 
models are stored in the system during the installation 
of the speech recognition module. Speech elements can 
include for example words, subwords. phonemes, tri- 
phones, polyphones or sentences. 
[0007] Updates or additions of new speech elements 
in current speaker-independent speech recognition sys- 
tems are impossible or at least cumbersome. Moreover, 
the speech recognition performance of speaker-inde- 
pendent speech recognition systems remains usually 
poor and the vocabulary of words accessible by voice 
is limited. 

[0008] Some handheld devices achieve better per- 
formance by using speaker-dependent speech models, 
which are usually learned during an enrolment and/or 
training session. During this phase, the speaker has to 



teach his handheld device how he pronounces words 
and commands. New words, for example new com- 
mands or new contact names or personal contacts in an 
agenda, can be inputted and must be trained. The error 
5 rate of speaker-dependent speech recognition systems 
is usually much lower than for speaker-independent 
speech recognition systems. However, many users view 
the time spent for the enrolment session as a burden, 
which is a serious barrier to broad market acceptance. 
10 [0009] Adaptative speech recognition methods have 
been proposed for other applications than for recogni- 
tion purposes in handheld devices. Adaptative methods 
preclude the need of an enrolment or training session 
and still reach very high recognition rates. In those sys- 
15 terns, initially available speaker-independent speech 
models are progressively adapted to the specific voice 
and language of the speaker, using speech elements, 
words and sentences spoken by the speaker during nor- 
mal working sessions. The adaptation may be guided - 
20 the speaker has to correct the recognition system each 
time a word has been wrongly recognized - or unguided 
- the recognition system corrects itself and improves the 
speech models automatically. 

[0010] Adaptative speech recognition systems are 
25 widely used, for example in speech recognition software 
for personal desktop computers. However for different 
reasons, they are only poorly adapted to speech recog- 
nition in handheld devices. It can be shown that the per- 
formance of adaptative speech recognition methods in 
30 handheld devices, in particular the initial recognition rate 
♦ before adaptation , remains poor. 
[0011] One first reason is that storage capacity in 
handheld devices, for example in PDAs, mobile phones 
or toys, is scarce in comparison with desktop comput- 
35 ers. Consequently, the vocabulary of trained words or 
speech elements for which a model is initially available 
should be kept small. Speech elements that do not be- 
long to the initial dictionary are not recognised; users of 
a new, untrained system still have to create models for 
40 all the speech elements absent from the initial set. 
[0012] Besides, as speech models built from utteranc- 
es of large groups of speakers tend to necessitate more 
storage space, the speech models initially loaded in 
handheld devices are often created from smaller data 
45 sets, which also decrease the initial recognition perform- 
ances. 

[0013] As a consequence, the recognition rate of a 
new : not yet adapted speech recognition module in a 
handheld device is often so poor that many users who 
50 only give it a few tries are discouraged from using it. 
[0014] One aim of the invention is therefore to build 
more easily, more quickly and with less user intervention 
improved, pre-adapted speech and/or language models 
for speech recognition in a handheld device. 
55 [0015] Another aim of the invention is to improve the 
recognition rate of speech recognition modules in hand- 
held devices, more particularly before the adaptation. 
[0016] Another aim of the invention is to adapt the 
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speech and/or language models available in a handheld 
device without necessitating an enrolment session. 
[0017] In accordance with one embodiment of the 
present invention, those aims are achieved with a meth- 
od for constructing speech and/or language recognition 
models during a speaker enrolment and/or training 
phase for a multimodal handheld device, wherein the 
speaker first selects on a visual user interface of said 
device one or several speaker clusters to which said 
speaker belongs. The handheld device sends said se- 
lected clusters to a remote server. The remote server 
computes or retrieves cluster-dependent speech and/or 
language models for speech elements in a cluster-de- 
pendent dictionary that are downloaded to said device. 
[0018] One advantage is that the speech and/or lan- 
guage models initially loaded in the handheld device, 
and the dictionary of speech elements for which models 
have been built, are already pre-adapted to the charac- 
teristics of the user and are therefore close to the user 
specifications. The initial recognition rate that can be 
reached with the models downloaded in the handheld 
device prior to the first use of its speech recognition 
module is therefore higher than in known enrolment-less 
systems in which speaker-independent models are ini 
tially used. 

[0019] Furthermore, as the speech and/or language 
models arebuilt from voice data gathered from speakers 
belonging to the same clusters as the user of the hand- 
held device, high recognition rates can eventually be 
achieved even with small speech and/or language mod- 
els built from small sets of speakers. 
[0020] Different types of speaker clusters can be de- 
fined. Best results will be reached if the speakers in each 
cluster have similar speech characteristics or. models. 
Preferably, one will use clusters which the speakers will 
have no difficulties in selecting for themselves. Exam- 
ples of suitable clusters may include sex, age, mother 
tongue, geographic location or origin, weight, education 
level and/or main professional activity of the speaker. 
Obviously, one speaker may chose to belong to different 
independent clusters in order to further improve the in- 
itial recognition rate before adaptation. For example, 
one speaker may define himself as a man, aged be- 
tween 30 and 40, having English as a mother tongue, 
born in Geneva, weighing between 70 and 80 KGs, 
working as a marketing assistant. 
[0021] If a speech sample has already been recorded, 
the system can help the user to select the right cluster, 
or check is the selection of categories is appropriate. 
[0022] The invention will be better understood with the 
help of the description of a specific embodiment illus- 
trated by the figures in which: 

Fig. 1 shows a diagram of a first embodiment of a 
telecommunication system according to the inven- 
tion. 

Fig. 2 shows a second embodiment of a telecom- 
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munication system according to the invention. 

Fig. 3 shows a third embodiment of a telecommu- 
nication system according to the invention. 

Fig. 4 shows a fourth embodiment of a telecommu- 
nication system according to the invention. 

Fig. 5 shows a fifth embodiment of a telecommuni- 
cation system according to the invention. 

Fig. 6 is a flowchart of the method of the invention. 

Fig. 7 illustrates one visual interface that can be 
used for implementing the method ot the invention 
with a handheld device. 



[0023] Fig. 1 shows a diagram of a telecommunication 
system according to the invention. It includes a remote 
20 server 1 operated by a service provider, for example by 
a mobile network operator, by an internet service pro- 
vider or by a software solution provider and serving a 
plurality of users (or subscribers) using multimodal 
handheld devices 4. The remote server includes a da- 
25 tabase 1 0 for previously stored speech data units, i.e. a 
collection of audio data corresponding to speech ele- 
ments uttered each by preferably several hundreds or 
thousands of persons. According to the invention , clus- 
ters (as later defined) to which the speaker of each audio 
30 data unit belongs are linked in the database 1 0 with the 
corresponding speech data units. 
[0024] A speech and/or language-computing module 
1 1 is provided for computing speech and/or language 
models corresponding to possible combinations of clus- 
35 ters in the database 1 0. The module 1 1 can send a query 
to the database 1 0 for retrieving all the speech data units 
corresponding to a selected combination of speaker 
clusters. For example, the module 11 can retrieve all 
speech data units recorded by female speakers less 
*o than 20 years old. Speech and language models are 
computed from those speech data units that can be 
used by an independent or adaptative speech recogni- 
tion module. 

[0025] In the illustrated embodiment, the server 1 fur- 
45 ther comprises an Internet server, for example an http 
or https server for connecting it to the Internet 2. Internet 
users can download speech and/or language models 
computed by the module 11 by connecting to a corre- 
sponding page on the https server 12, as will be de- 
50 scribed later Speech samples can be uploaded for au- 
tomatically suggesling clusters or verifying selected 
clusters. 

[0026] The system of the invention comprises a plu- 
rality of multimodal handheld devices 4; for the sake of 
55 simplicity, only some components of a single device 4 
are illustrated on Figure 1 . In this description and in the 
claims, multimodal means that the handheld device 4 
comprises at least two different input means 40 for in- 
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putting commands or data, one of those means com- 
prising a microphone and a preferably adaptative 
speech recognition module 43 for inputting spoken com- 
mands and texts. Depending on the device, the other 
input means 40 can comprise, for example, a pen-based 5 
interface with a handwriting recognition module, a key- 
pad, a soft keyboard, buttons, joysticks, etc or any com- 
bination of those means. 

[0027] In this first embodiment, the handheld device 
4 comprises a radio interface 44 for connecting it to a 10 
mobile data communication network 3. The network 3 
can for example be a wireless LAN, for example a IEEE 
802.11b LAN, a GSM (Global System Mobile), GPRS 
(General Packet Radio Service), HSCSD (High Speed 
Circuit Switched Data), Edge or UMTS (Universal Mo- 15 
bile Telecommunication System) network; different data 
communication layers, for example WAP (Wireless Ap- 
plication Protocol) or l-Mode, can be established over 
this network. The device 4 can connect itself to the In- 
ternet 2 through the mobile network 3 and through an 20 
Internet Service Provider 20. The interface module 44 
can be internal, for example if the handheld device con- 
sists of a mobile phone or is connected to the handheld 
device 4 over an IrdA, PCCard, Bluetooth, WLAN, serial 
or direct connection. 25 
[0028] The speech recognition module 43 preferably 
comprises a software application executed by the gen- 
eral-purpose microprocessor (not shown) of the hand- 
held device for providing speech recognition services to 
other applications. This module 43 is preferably written 30 
in an object-oriented language, for example C++, IDL or 
JAVA (registered mark of SUN Microsystems), prefera- 
bly using distributed components, for example using 
CORBA or a web services description language, but can 
also comprise modules written in an iterative language. 35 
It preferably uses adaptative Hidden Markov Models 
(HMMs) or adaptative neuronal networks for recognis- 
ing speech elements spoken by the user of the handheld 
device and for converting them into commands and/or 
text input in various applications. The module 43 uses 40 
adaptative speech models 41 and adaptative language 
models 42. Speech models define the way each speech 
element is pronounced by the speaker. Language mod- 
els define typical expressions and syntaxes often used 
by the speaker. The speech recognition module 43 uses 45 
the models 41 , 42 to recognise speech elements, includ- 
ing words, commands and sentences spoken by the us- 
er; furthermore, it continuously adapts those models, in 
a guided or preferably unguided way, in order to improve 
the recognition rate reached by the user of the handheld so 
device. 

[0029] The one skilled in the art will understand that, 
if a plurality of users share the same device 4, a corre- 
sponding plurality of speech and language models 41 , 
42 can be defined and stored in the same handheld de- 55 
vice 4. 

[0030] Figure 2 illustrates a second embodiment of 
the system of the invention. Features which are identical 



to those of Figure 1 share the same reference numbers 
and will not be described again. In this embodiment, the 
illustrated handheld device 4 lacks the module for direct- 
ly connecting it to a cellular data network; instead, it can 
be connected over an interface 45 to a terminal 5, for 
example a personal computer, which is itself connected 
to the Internet 2. In a preferred embodiment, the con- 
nection between the handheld device 4 and the terminal 
5 is achieved by inserting the device 4 on a cradle (not 
shown) connected to the terminal 5 over an USB link, 
for example. Other ways of connecting a handheld de- 
vice, for example a PDA or a mobile phone, to a personal 
computer can easily be conceived by the one skilled in 
the art. As a non -exhaustive list of examples, we will cite 
a direct USB link, a wireless short-range interface (for 
example an infrared or Bluetooth interface), a serial in- 
terface, a wireless LAN (WLAN) according to 802.11b 
or HomeRR etc. 

[0031] Figure 3 illustrates a third embodiment of the 
system of the invention. Features which are identical to 
those of Figure 1 and/or 2 share the same reference 
numbers and will not be described again. In this embod- 
iment, the speech and language models are retrieved 
from a server 5 S for example a personal computer, con- 
nected with the multimodal handheld device 4 over a 
temporary, for example serial, or permanent, for exam- 
ple in a local LAN, connection. In a preferred embodi- 
ment, the speech and/or language models 10 are com- 
puted by the server 5 using a software application 11 
stored on a storage medium 50. for example a floppy or 
an optical storage medium. The software application us- 
es a database 10 for speech data units that can be 
stored together with the application on the storage me- 
dium 50 or, as an alternative, retrieved over the Internet. 
If the database 10 is locally available, this embodiment 
is particularly suitable for users who lack a fast Internet 
connection. Alternatively, it would also be possible to re- 
trieve the parameters for speech recognition corre- 
sponding to the selected clusters from a storage medi- 
um directly connected to the multimodal handheld de- 
vice. 

[0032] The performance, including the processing 
power and the storage space, of many handheld devic- 
es is usually smaller than that delivered by personal 
computers or servers in a LAN environment. In some 
cases, those performances may be insufficient for effi- 
ciently implementing a software-based speech recogni- 
tion algorithm. Figure 4 illustrates a fourth embodiment 
of the invention that addresses this problem. Features 
that are identical to those of the preceding figures share 
the same reference numbers and will not be described 
again. In this embodiment, the multimodal handheld de- 
vice 4 is permanently connected over a wireless local 
area network 9 to a local server 7. The speech recogni- 
tion module 43 and the speech and/or language models 
41, 42 initially downloaded from a remote server 1 are 
stored and run by the server 43. In this embodiment, 
speech recognition services requested by the multimo- 
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dal handheld device 4 are delivered by the locally con- 
nected server 7. Distributed programming protocols, in- 
cluding web services, may be used for programming this 
embodiment. 

[0033] Figure 5 illustrates a fifth embodiment of the 
system of the invention. Features that are identical to 
those of the preceding figures share the same reference 
numbers and will not be described again. In this embod- 
iment, a complete speech recognition system is stored 
and run by a local server connected to the multimodal 
handheld device 4 over a wireless LAN 4, as in the fourth 
embodiment. However, a simpler speech recognition 
module 43\ and/or a smaller set of speech models 41 \ 
and/or a smaller set of language models 42', is stored 
and can berun by the multimodal handheld device 4. In 
this embodiment, the speech recognition system 41 \ 42' 
and 43' of the handheld device is used when there is no 
access to the server 7, or when this system is reliable 
enough for the current application, and the more pow- 
erful speech recognition system 41 , 42, 43 of the server 
7 is used when a more reliable speech recognition is 
needed. 

[0034] The one skilled in the art will recognize that is 
also possible to have the speech recognition module 43 
and/or the models 41 . 42 in a server accessible over a 
public network, for example over the Internet, when a 
fast permanent connection is available. 
[0035] Figure 6 is a flowchart illustrating a preferred 
embodiment of the method of the invention. In a first step 
400, the speech recognition application module is 
launched in the handheld device. This may occur for ex- 
ample because the user starts an application on his 
handheld requiring speech recognition, or because the 
user has pressed a special speech button 40 (Figure 7) 
for turning on the microphone 46, recording the voice 
and converting it to text. 

[0036] During step 402, the speech recognition appli- 
cation module checks if speech and language models 
are available. This should always be the case, except 
before the first use of the speech recognition module. 
[0037] If the speech and language models 41 , 42 are 
available, the process continues with step 414 during 
which the recorded speech is converted to text or com- 
mands using known speech analysing methods and the 
available models 41 , 42. The converted text or com- 
mand is sent to the requiring application. 
[0038] During step 416, which can be combined with 
step 414, the speech and language models 41 , 42 are 
adapted to the speech and language of the user of the 
handheld device 4. Adaptation may be performed using 
known speech models adaptation methods and may be 
guided (i.e. the user acknowledges or corrects the text 
converted by the speech conversion module) or unguid- 
ed (the module detects and automatically adapts the 
speech and/or language models in order to improve the 
recognition confidence rate). When all the spoken 
speech elements have been recognised, the speech 
recognition application module ends at step 418. 



[0039] If the module detects during step 402 that 
speech and/or language models are not yet available 
for the user currently logged in, it continues with the 
speaker clusters selection step 404. This step may in- 
5 elude the execution of a dedicated speaker clusters se- 
lection routine in the handheld device 4, or the connec- 
tion with the handheld device 4 to a speaker clusters 
web, WAP or l-mode page on htpps server 12. In anoth- 
er embodiment, the speaker clusters selection step is 
w performed with the terminal 5, for example with a dedi- 
cated application in the terminal 5 or by accessing a ded- 
icated web page on htpps server 12 with terminal 5. 
[0040] During step 406, the speaker actually selects 
speaker clusters to which he believes he belongs. This 
*5 step is illustrated on Figure 7. Various types of speaker 
clusters are presented to the speaker on the display 47 
of the handheld device 4. In this example, the available 
clusters include sex, age, mother tongue, geographical 
origin, weight, education level, main professional activ- 
20 rty. The speaker may also select during this step the lan- 
guage of the desired speech models (not shown). The 
one skilled in the art will have no difficulties in imagining 
other types of predefined clusters for subdivising the 
speakers into different classes sharing common speech 
25 or language properties. For each cluster type, the user 
can enter or select in a menu on the visual interface of 
his multimodal handheld device values corresponding 
to his cluster. In a preferred embodiment, clusters are 
selected with a pen-based interface directly on the dis- 
30 play 47 of the handheld. Other selection means may be 
available. Clusters can be suggested by the server 1 if 
user's speech samples have been previously uploaded. 
[0041 ] During step 408, the selected clusters are sent 
to the remote server 1 . If the values have been selected 
35 on a web page of the server 1 2, this step simply implies 
to send a fulfilled web form to the server 1 2. If the values 
have been selected with a dedicated routine in the hand- 
held device 4, this step may include sending a message, 
for example an email or SOAP message, to or estab- 
40 lishing a connection with the remote server 1 . 

[0042] When it receives the list of clusters selected by 
the user of a handheld device, the remote server 1 
sends a query to the speech database 1 0 for retrieving 
the speech data of the set of speakers corresponding to 
4 5 the selected combination of clusters. It then computes 
speech models and/or language models corresponding 
to the selected set of speakers. In order to compute ef- 
ficient speech models, the set preferably includes at 
least several dozens, if possible several hundreds, dif- 
50 ferent speakers. If storage space is an issue, the dic- 
tionary of speech elements for which models are built 
also depends on the selected combination of speaker 
clusters. For instance, it would be possible to download 
a dictionary of words including a different professional 
55 jargon for each professional categories selected. It may 
also compute or retrieve different probabilities of occur- 
rence associated with each speech element, for exam- 
ple with each word, and/or with each sequences of 
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speech elements, depending on the selected cluster. 
[0043] If the speech recognition module of the hand- 
held device uses specific grammars, for example for re- 
trieving dates or other formatted data units, the set of 
downloaded grammars may also depends on the select- 5 
ed clusters. In an embodiment., it is also possible to 
adapt the parameters needed for the speech features 
extraction to each cluster, for example to extract differ- 
ent features from a woman's speech sample than from 
a man's in order to take into account different psychoa- 10 
coustic parameters. 

[0044] If speech samples from the user are available, 
the server 1 may check if the clusters selected by the 
user are appropriate. 

[0045] The handheld device 4 waits during step 410 '5 
for an answer from the remote server 1 . A time limit may 
be set after which the selected clusters will be sent again 
or the process will be stopped. 

[0046] If a positive answer is received, the speech 
recognition module 43 in the handheld device 4 down- 20 
loads during step 412 the parameters for the speech 
recognition corresponding to the combination of clusters 
selected by the speaker. Those parameters preferably 
include speech and/or language models made of one or 
several data files in* a format depending on the speech 25 
recognition module used by the handheld device 4. 
[0047] As the speakers selected for constructing the 
downloaded speech and language models belong to the 
same clusters as the user of the multimodal handheld 
device 4, the downloaded initial set already allows an 30 
acceptable recognition rate of speech elements spoken 
by this user, even if the number of speakers used for 
building this speech model is limited. 
[0048] A speech and/or language models check can 
be performed to verify the downloaded models (not 35 
shown). The user has to speak a few sentences, for ex- 
ample sentences displayed on the visual interface of his 
handheld device 4, which the speech recognition mod- 
ule tries to recognize with the downloaded models. If the 
recognition rate is too low, or if the recognition confi- *o 
dence level is below a predefined threshold, the models 
are rejected and the speech recognition module 
prompts the user to select another combination of clus- 
ters. The system may also suggest better or more ap- 
propriate clusters, or use a standard speech model. 45 
[0049] In another embodiment (not shown), the check 
of the speech and language models built from the se- 
lected combination of speaker clusters is performed in 
the remote server 1 prior the downloading of said 
speech and language models. For this, speech data re- so 
corded from the speaker 1 have to be sent to the remote 
server 1 . 

[0050] Once the speech and language models have 
been accepted, the speech recognition application mod- 
ule goes on with the already described steps 414 to 418 55 
for recognizing speech elements and commands ut- 
tered by the user. 

[0051] In all the above-described embodiments, the 



remote server 1 is made of a central server accessible 
by a plurality of users over the Internet. The one skilled 
in the art will understand that the same functions can 
also be performed by a server, for example by the per- 
sonal desktop computer 5, in the vicinity of the user. It 
would be possible for example to store the speech data 

10 and the speech and/or language computing module 

11 on an optical storage disk sold with the speech rec- 
ognition application module 44 and loaded in the com- 
puter 5 for providing the same functionality. This allows 
an easier downloading of speech and language models 
for users who do no have an Internet connection, but 
has the drawback that late adaptation of the contents of 
the database 10, for example in order to include new 
speech elements or new commands corresponding for 
example to new software applications available, are 
much harder to distribute to a large community of users. 
[0052] Furthermore, the one skilled in the art will un- 
derstand that different portions of the speech and lan- 
guage models can be downloaded from different remote 
servers 1 This allows for example a software provider 
to distribute over his web site pre-adapted speech and 
language models corresponding to the command words 
needed for running his software. It may also be possible 
to have different servers for different languages of the 
speech models. 



Claims 

1 . A methodf or building speech and/or language mod- 
els (41 , 42) used for recognition of speech spoken 
into a multimodal handheld device (4), said method 
comprising the following steps: 

a speaker selects (406) on a visual user inter- 
face (47) of said multimodal handheld device 
(4) one or several speaker clusters to which 
said speaker belongs, 

cluster-dependent parameters (41 , 42) for the 
speech recognition are computed and used 
(412) for said speech recognition. 

2. The method of claim 1 , wherein said parameters for 
the speech recognition include cluster-dependent 
speech and/or language models (41 , 42). 

3. The method of one of the claims 1 or 2, wherein said 
parameters for the speech recognition are down- 
loaded from a remote server (1). 

4. The method of claim 3, wherein said parameters for 
the speech recognition are downloaded into said 
device (4) over one of the following networks: GSM, 
GPRS ; HSCSD, UMTS. WLAN. 

5. The method of one of the claims 1 or 2, wherein said 
parameters for the speech recognition (41 , 42) are 
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retrieved from a storage medium (50) non-perma- 
nently connected to said multimodal handheld de- 
vice. 

6. The method of one of the claims 1 or 2, wherein said 5 
parameters for the speech recognition (41 , 42) are 
stored in a server (7) permanently accessible from 
said multimodal handheld device. 

7. The method of claim 1 , wherein said speaker cate- 10 
gories comprise at least two categories chosen 
among the following: 

sex 

age 15 

mother tongue 

geographic location or origin 

weight 

education level 

main professional activity, 20 

said speech and/or language models being com- 
puted from speech data of speakers corresponding 
to the selected combination of categories. 



14. The method of one of the claims 1 to 1 3, said clus- 
ters being selected prior to the first use of the 
speech recognition module (44). 

15. The method of one of the claims 1 to 14, said clus- 
ters being selected on a web page in said multimo- 
dal handheld device (4). 

16. The method of one of the claims 1 to 1 5, said clus- 
ters being selected with a dedicated speaker clus- 
ters selection routine in said multimodal handheld 
device (4). 

17. The method of one of the claims 1 to 1 6, said clus- 
ters being selected with a pen-based interface on 
said multimodal handheld device (4). 



The method of one of the claims 1 to 7, further com- 
prising the following steps: 

said speaker speaks in order to check if the se- 
lected cluster-dependent parameters for the 
speech recognition (41 , 42) are indeed appro- 
priate. 

The method of step 8, said sentences being sent to 
said remote server (1) for performing a remote 
check prior to said downloading step. 



10. The method of one of the claims 1 to 7, further com- 
prising a step of using available speech data from 
the speaker to suggest some clusters to said speak- *o 
er. 
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18. A server, comprising: 

means for connecting it to a public telecommu- 
nication network (2), 

a database (10) containing speech data units, 
each speech data unit being linked to several 
clusters to which the speaker of said data unit 
belongs, 

computing means (11) for computing parame- 
ters for the speech recognition from a subset of 
speech data records in said database corre- 
sponding to various combinations of clusters. 

19. The server of claim 18, further comprising an inter- 
net server (12) over which said computed parame- 
ters for the speech recognition can be downloaded. 

20. The server of claim 1 9, wherein said Internet server 
includes at least one page with which remote users 
can define a combination of clusters for which pa- 
rameters for the speech recognition have to be 
computed. 



11. The method of one of the claims 1 to 10, further 
comprising a step of adapting (416) said parame- 
ters for the speech recognition (41, 42) to said 45 
speaker. 

12. The method of one of the claims 2 to 11, wherein 
the list of speech elements for which speech and/ 

or language models (41 , 42) are made available de- so 
pends on said selected clusters. 



13. The method of one of the claims 1 to 12, wherein 
said remote server (1 ) selects pre-recorded speech 
data units (10) corresponding to speakers belong- 55 
ing to the selected combination of clusters in order 
to compute on-the-fly said cluster dependent 
speech and/or language models (41 , 42). 
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